Support Vector Machines

Here we approach the two-class classification problem in a direct way:
- We try and find a plane that separates the classes in feature space
If we cannot, we get creative in two ways
- We soften what we mean by “separates”
- We enrich and enlarge the feature space so that separation is possible

What is a Hyperplane?

A hyperplane in \(p\) dimensions is a flat affine subspace of dimension \(p − 1\)
In general the equation for a hyperplane has a form:
- \(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p= 0\)
In \(p=2\) dimensions a hyperplane is a line
If \(\beta_0=0\), the hyperplane goes through the origin, otherwise not
The vector \(\beta = (\beta_1, \beta_2, ..., \beta_p)\) is called the normal vector
- It points in a direction orthogonal to the surface of a hyperplane
Hyperplane in 2 Dimensions

If \(f(X) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p\), then \(f(X) > 0\) for points on one side of the hyperplane, and \(f(X) <0\) for points on the other
If we code the colored points as \(Y_i = +1\) for blue and \(Y_i = -1\) for mauve, then if \(Y_i \cdot f(X_i) > 0\) for all \(i\), \(f(X) = 0\) defines a separating hyperplane

Among all separating hyperplanes, find the one that makes the biggest gap or margin between the two classes
Constrained optimization problem:
- \(\underset{\beta_0, \beta_1,...,\beta_p}{maximize}M\)
- Subject to \(\sum_{j=1}^{p} \beta_j^2=1\), \(y_i(\beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip}) \geq M\)
  - For all \(i = 1,...,N\)

This can be rephrased as a convex quadratic program, and solved efficiently. The function svm() in package e1071 solves this problem efficiently

Sometimes the data are separable, but noisy. This can lead to a poor solution for the maximal-margin classifier
The support vector classifier maximizes a soft margin

Sometime a linear boundary simply won’t work, no matter what value of \(C\)
What to do?
Feature expansion
- Enlarge the space of features by including transformations
  - \(X_1^2, X_1^3, X_1X_2, X_1X_2^2\) Hence go from a p-dimensional space to a \(M>P\) dimensional space
- Fit a support-vector classifier in the enlarged space
- This results in non-linear decision boundaries in the original space

Polynomials (especially high-dimensional ones) get wild rather fast
There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers — through the use of kernels
Before we discuss these, we must understand the role of inner products in support-vector classifiers

\((x_i, x_{i'}) = \sum_{j=1}^{p} x_{ij}x_{i'j}\)
The linear support vector classifier can be represented as: \(f(x) = \beta_0 + \sum_{i=1}^{n} \alpha\left<x,x_i \right>\)
- n parameters
To estimate the parameters \(\alpha_1, ..., \alpha_n\) and \(\beta_0\), all we need are the \(\binom{n}{2}\) inner products \(\left<x,x_i \right>\) between all pairs of training observations
It turns out that most of the \(\hat{\alpha}_i\) can be zero: \(f(x) = \beta_0 + \sum_{i \in s} \hat{\alpha}_i \left<x,x_i \right>\)
- \(S\) is the support set of indices \(i\) such that \(\hat{\alpha}_i > 0\)

If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract
Some special kernel functions can do this for us:
- \(K(x_i,x_{i'}) = (1 + \sum_{j=1}^{p} x_{ij}x_{i'j})^d\)
  - computes the inner products needed for d dimensional polynomials: \(\binom{p+d}{d}\)
- The solution has the form: \(f(x) = \beta_0 + \sum_{i \in s} \hat{\alpha_i}K(x, x_i)\)
Radial Kernel
- Implicit feature space; very high dimensional
- Controls variance by squashing down most dimensions severely

When classes are (nearly) separable, SVM does better than LR. So does LDA
When not, LR (with ridge penalty) and SVM very similar
If you wish to estimate probabilities, LR is the choice
For nonlinear boundaries, kernel SVMs are popular. Can use kernels with LR and LDA as well, but computations are more expensive